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Abstract 


Dropout is typically interpreted as bagging a large number of models sharing pa¬ 
rameters. We show that using dropout in a network can also be interpreted as 
a kind of data augmentation in the input space without domain knowledge. We 
present an approach to projecting the dropout noise within a network back into 
the input space, thereby generating augmented versions of the training data, and 
we show that training a deterministic network on the augmented samples yields 
similar results. Finally, we propose a new dropout noise scheme based on our 
observations and show that it improves dropout results without adding significant 
computational cost. 


1 Introduction 


Noise is normally seen as intrinsically undesirable. The word itself bears a very negative connota¬ 
tion. It is not surprising then that many early mathematical models in neuroscience aimed to factor 
out noise by any means. A few decades ago, the use of stochastic resonance (Wiesenfeld et al., 1995) 
in neuro-scientific models initiated a new interest in neurosience regarding random fluctuations and 
the role they play in the brain. Theories about neuronal noise are now fiourishing and previous 
deterministic models are improved by the incorporation of noise (Yarom & Hounsgaard, 2011). 

Biological brains have always been a strong inspiration when it comes to developing learning al¬ 
gorithms. Considering the amount of noise which takes place in the brain during learning, one 
can wonder if this has any beneficial effect. Many techniques in machine learning have made use 
of noise to improve performance recently, namely, Denoising Autoencoders (Vincent et al., 2008), 
dropout (Hinton et al., 2012) and its relative, DropConnect (Wan et al., 2013). Those successful ap¬ 
proaches suggest that neuronal noise plays a fundamental role in the process of learning and should 
be studied more thoroughly. 

Using dropout can be viewed as training a huge number of neural networks with shared parameters 
and applying bagging at test time for better generalization (Baldi & Sadowski, 2013). Binary noise 
can also be viewed as preventing neurons from co-adapting, which improves the generalization of 
the model even more. In this paper, we propose an alternative view and suggest noise schemes like 
dropout are implicitly incorporating a form of sophisticated data augmentation. In Section 3, we 
formulate a method to generate data which replicates dropout noise within a deterministic network, 
and demonstrate in Section 5 that there is no significant loss of accuracy. 

Finally, capitalizing on the idea of data augmentation, we present in section 4 an extension of dropout 
which uses random noise levels to improve the variety of samples. This simple extension improves 
classification performance across different network architectures, yielding competitive results on the 
MNIST permutation invariant classification task. 


*Both authors contributed equally 
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2 Dropout 

The main goal when using dropout is to regularize the neural network we are training. The technique 
consists of dropping neurons randomly with some probability p. Those random modifications of 
the network’s stucture are believed to avoid co-adaptation of neurons by making it impossible for 
two subsequent neurons to rely solely on each other (Srivastava et aL, 2014). The most accepted 
interpretation of dropout is that is implicitely bagging at test time a large number of neural networks 
which share parameters. 

Assume h{x) is a linear projection of a -dimensional input x into a d/,,-dimensional space: 

h{x) = xW -h b (1) 

Given a{h) and a{h), m activation function and its noisy version where M ^ B{ph) and rect(/i) is 
a rectifier 


a{h) = rect(/i) (2) 

a{h) = M © rect (h) (3) 

Eq. 3 denotes the activation with dropout during training and eq. 2 the equation of the activation at 
test time. Srivastava et al. (2014) suggest to scale the activations a{h) with p at test time to get an 
approximate average of the unit activation. 


3 From a data augmentation perspective 

In many previous works it has been shown that augmenting data by using domain specific transfor¬ 
mations helps in learning better models (LeCun et al., 1998; Simard et al., 2003; Krizhevsky et al., 
2012; Ciresan et al., 2012). In this work, we analyze dropout in the context of data augmentation. 
Considering the task of classification, given a set of training samples, the objective would be to 
learn a mapping function which maps every input to its corresponding output label. To generalize, 
the mapping function needs to be able to correctly map not just the training samples but also any 
other samples drawn from the data distribution. This means that it must not only map input space 
sub-regions represented by training samples, but all high-probability sub-regions of the natural dis¬ 
tribution. 

One way to learn such a mapping function is by augmenting the training data such that it covers a 
larger portion of the natural distribution. Domain-based data augmentation helps to artificially boost 
training data coverage which makes it possible to train a better mapping function. We hypothesize 
that noise based regularization techniques result in a similar effect of increasing training data cover¬ 
age at every hidden layer and this work presents multiple experimental observations to support our 
hypothesis. 

3.1 Projecting noise back into the input space 
We assume that for a given a{h), there exist an x*, such that 


(a o /i)(x*) = rect(/i(x*)) ^ m 0 rect {h{x)) = (a o h){x) (4) 

Similarly to adversarial examples from Goodfellow et al. (2014b), an x* can be found by minimizing 
the squared error L using stochastic gradient descent 

1/ (x, X*)) = |(a o /i)(x*) — (a o h){x)f (5) 

Equation 5 can be generalized to a network with n hidden layers. To lighten notation we first define 
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f^^\x) = o o • • 

- • o o (x) 

(6) 
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We can now compute the back projection corresponding to all hidden layer activations at once, 
which results in minizing the loss L 




.( 1 )* 


.(n)* 




( 8 ) 


We can show by contradiction that one is unlikely to find a single x* = = ... = 

that significantly reduces L. The proof is detailed in appendix subsection 8.1. Fortunately, it is easy 
to find a different x* for each hidden layer, by providing multiple inputs ( OC ^ X ^ ^ ^ X ^ ^ ^ . . . ^ X ^ ^ j, 
where n is the number of hidden layers. As each x^^^* is the back projection of a transformation 
in the representation space defined by the i-th hidden layer, it suggests viewing dropout as a so¬ 
phisticated data augmentation procedure that samples data around training examples with respect to 
different level of representations. This raises the question whether we can train the network deter¬ 
ministically on the rather than using dropout. The answer is not trivial, because 

1. When using (x, x ^^^*, , • • •, as inputs, dropout is not effectively applied to every 

layer at the same time. The local stochasticity preventing co-adaptation is then present at 
a specific layer only once for every x^^^*. This could be not aggressive enough to avoid 
co-adaptation. 

2. The gradients of the linear projections will differ greatly. In the case of dropout, is 

always equal to its input transformation, i.e. f^'^~^\x), whereas the deterministic version 
of the training will update according to ..., 

Although we proved a single x* minimising 8 is difficult to find for a large network, we show ex¬ 
perimentally in section 5 that it is possible to do so within reasonable approximation for a relatively 
small two hidden layer network. We further show that dropout can be replicated by projecting the 
noise back on the input space without a significant loss of accuracy. 


4 Improving the set oe augmentations 

When dealing with domain-based transformations, we intuitively look for the richest set of trans¬ 
formations. In computer vision for instance, translations, rotations, scalings, shearings and elastic 
transformations are often combined. Looking at dropout from a data augmentation perspective, this 
intuition raises the following question: given that noise scheme used is implicitely applying some 
transformations in the input space, which one would produce the richest set of transformations? 

With noise schemes like dropout, there are two important components which influence the transfor¬ 
mations; The probability distribution of m and the features of the neural network used to encode 
h{x). Modifying the probability distribution is the most straighforward way to improve the set of 
transformations and will be the main focus of this paper. However, features of the neural network 
play a key role in the transformations and we will outline some possible avenues in the conclusion 
section. 

4.1 Random noise levels 

While using dropout, the proportion of neurons dropped is very close to probability p. It follows 
naturally from Binomial distribution’s expectation. The transformations induced are as different as 
the values M can take. Despite this, their magnitude is as constant as the proportion of neurons 

^Because we train on n samples from x, one for each hidden layer 
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Probability density function on 
proportion of neurons dropped 
with p=0.5 



proportion of dropped neurons {k) 


Figure 1: The density function on proportion of neurons dropped is very peaked around p for 
dropout. This results in a low variance of the transformations magnitude induced by dropout. Ran¬ 
dom dropout on the other hand has a constant density function under pN, N being the number of 
neurons, because p ^ Z//(0,p). It is thus more likely to see transformations closer to identity while 
it is very unlikely for standard dropout. This is intuitively desirable, as much as varying translation 
distances is preferable over constant large distances. 


dropped. That means, every transformations displaces the sample to a relatively constant distance 
but in random directions in a high dimensional space. 

A simple way to vary the transformation magnitude randomly is to replace p by a random variable. 
Let ph ^ U{0,ph) and M^ij ^ ^{Ph) where h defines the layer, i the sample, and j the layer’s 
neuron. It is important to use the same p for all neurons of a layer, otherwise we would have 
Mhij ^ S(t). 

To compensate for the change of level of activations during test, a scaling is normally applied. One 
could also simply apply the inverse scaling during training, turning equation 3 into 


a{h) 


- M 0 rect (h) 

1 -p 


To adapt the equation to random dropout level, we simply need to replace p with p 


(9) 


a{h) = 0 — M O rect (h) (10) 

1- p 

No scaling needs to be done during test anymore. 

Figure 1 shows the differences between density function on proportion of neurons dropped for 
dropout and random dropout. Transformations induced by random dropout are clearly more diverse 
than those induced by dropout. 


5 Experiments 

5.1 Visualizations of noise projected back into the input space 

Visualizing the noise projected back into the input space helps to understand what kind of transfor¬ 
mations are induced by dropout. Unsupervised models learn more general features than supervised 
fully connected neural networks and produce thus more visually appealing transformations. Eor this 
reason, we trained autoencoders with dropout on the hidden layer to generate samples of transfor¬ 
mations. 

The autoencoder is very similar to denoising autoencoders, the only difference is that a Bernoulli 
mask is applied to the hidden activations rather than to the input. There is thus no noise applied 
to the input explicitly. Models are trained for 300 epochs, with mini-batch size of 100, p = 0.4, a 
learning rate of 0.001 on MNIST, 0.0001 on CIEAR-10 and a momentum of 0.7 on MNIST, 0.5 on 
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(a) MNIST 

Figure 2: Visualization of noisy samples on MNIST and CIFAR-10. The first column represent 
original samples from MNIST and CIFAR-10 datasets. Each row contains samples from the same 
original sample. Every other column represents noisy samples produced by back-projecting the 
noise into the input space. Each one is induced by a different noise mask, i.e. a different value of 

M. 



(b) CIFAR 



Eigure 3: Visualization of the influence of the features on the transformations induced by dropout on 
MNIST. Eirst column represent original samples from MNIST dataset. Second column represents 
the five most active features on each given original sample. Last column represents noisy samples 
produced by back-projecting the noise into the input space while keeping the selected feature shut 
off to 0. We can see that the features are not simply removed from the input, but rather destroyed in 
such a way that other features highly dependant on the same subregion still have the same activation 
level. 


CIEAR-10. Eor CIEAR-10, we do preprocessing with PC A dimensionality reduction and retain 781 
features. 

Once the model is trained, we use gradient descent to compute x* as described in 3.1. We iterate 
for 10 epochs with a learning rate of 100 for both MNIST and CIEAR-10. Eigure 5 shows well how 
close X* are from the natural input space and we clearly see that the classes are still distinguishable. 

To help understand how each feature infiuences the transformation, we isolate five most active fea¬ 
tures for a given input and shut them off, each separately. Eor each feature shut off, we compute x* 
using gradient descent. Eigure 3 shows the results found for MNIST. One could think the features 
dropped by the noise are simply removed from the input. It turns out removing the feature would 
affect the activation of other neurons. Because of this, features are rather destroyed in the input in 
such a way that other features highly dependant on the same subregion still have the same activation 
level. 
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Figure 4: Error percentages on the MNIST and CIFAR-10 datasets using MLP architectures trained 
with different corruption schemes. Row-1: Experiments using dropout. Row-2: Experiments using 
noise back projection. Column-1: CIEAR-10 Column-2: MNIST. 


5.2 Equivalence of dropout and noisy samples 

We ran a series of experiments with fully connected feed forward neural networks on the MNIST 
and CIEAR-10 datasets to study the effect of replacing dropout by corresponding noisy inputs. Each 
network consists of two hidden layers with rectified linear units followed by a softmax layer. We 
experimented with four different architectures each one with a different number of units in the hidden 
layers: 2500-2500, 2500-1250, 1500-1500 and 1500-750. 

The MNIST dataset consists of 60000 training samples and 10000 test samples. We split the training 
set into a set of 50000 samples for training and 10000 for validation. Each network is trained for 
501 epochs and the best model based on validation error is selected. Later the best model is further 
trained on the complete training set of 60000 samples (training validation split) for another 150 
epochs. The mean error percentage for the last 75 epochs is used for comparison. 

We also ran experiments on the CIEAR-10 permutation invariant task using the same network archi¬ 
tectures described above. The dataset consists of 50000 training samples and 10000 test samples. 
We use PCA based dimensionality reduction without whitening as preprocessing scheme, retaining 
781 features. We used the same approach as in the MNIST experiments to train the networks and 
for reporting the performances. 

At each epoch, an x* is generated for each training sample. It proved to be possible to find good 
X* approximations for the entire network at once for a 2-hidden layer network. Thus, we trained on 
X and X* solely rather than x, x^^^* and as it gave a significant speed up. Eor simplicity, the 
network is trained on x for an epoch than on x* for an epoch. All x* are generated with parameter 
values of the model at the beginning of the epoch. 

Noisy inputs x* are found using stochastic gradient descent. 20 learning steps are done with a 
learning rate of 300.0 for first hidden layer and 30 for second hidden layer. The results for these 
experiments are shown in figure 4. 

Results suggest that Dropout can be replicated by projecting the noise back into the input and train¬ 
ing a neural network deterministically on this generated data. There is not significant drop in ac- 
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(b) CIFAR 


Figure 5: Visualization of noisy samples from random dropout on MNIST and CIFAR-10. The 
first column represent original samples from MNIST and CIFAR-10 datasets. Each row contains 
samples from the same original sample. Every other column represents noisy samples produced by 
back-projecting the noise into the input space. Each one is induced by a different noise mask and 
a different noise level, i.e. a different value of p and M. The transformations from figure 5 and 
this one are clearly different. Random dropout applies transformations with different strenghs, i.e., 
the transformed input can be very close to very far from the original input while standard dropout 
always applies transformations with the same strengh. 


curacy, it is even slightly better than Dropout in the case of CIEAR-10. This supports the idea that 
dropout can be seen as data augmentation. 

5.3 Richer noise schemes 

We ran a series of experiments with fully connected feed forward neural networks on the MNIST and 
CIEAR-10 datasets to compare dropout. Each networks consist of two hidden layers with rectified 
linear units followed by a softmax layer. We experimented with three different network architectures 
each one with a different number of units in the hidden layers: 2500-625, 2500-1250 and 2500-2500. 
Each network is trained and validated the same way as mentioned in previous section. 

Eirst, we evaluated the dropout noise scheme by training the networks with a fixed hidden noise level 
of 0.5 and the input noise level varying from 0.0 to 0.7 with increments of 0.1 for each experiment. 
In the second experiment, we fixed the input noise level at 0.2 and the hidden noise level is varied 
from 0.0 to 0.7, again with an increment of 0.1. In the final set of experiments we use the random 
dropout noise scheme using the same noise level at input and hidden layers. The noise level in 
this case is a range [0, x] where x is varied from 0.0 to 0.8 with increment 0.1. The classification 
performances corresponding to the all the experiments on both the datasets are reported in Eigure 6. 

Random dropout improves the performance of the models over dropout with no additional compu¬ 
tational cost. 


6 Related Work 

To the best of our knowledge there is no work analyzing dropout from a data augmentation per¬ 
spective. Nonetheless, there is a plethora of excellent works about dropout, some describing his 
regularisation properties and others developing new kind of noise schemes based on different intu¬ 
itions. 

Regularization properties of noise have been known for more than a decade. Bishop (1995) showed 
for instance that the regularization term induced by noise belongs to the class of generalized 
Tikhonov regularizers for sum-of-squares error functions. More recently, Baldi & Sadowski (2013) 
proved that dropout noise scheme specifically applies a regularisation term very similar to usual 
weight decay. 
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Figure 6: Error percentages on the MNIST and CIFAR-10 datasets using MLP architectures trained 
with different corruption schemes. Row-1: Experiments on CIFAR-10. Row-2: Experiments 
on MNIST. Column-1: Using dropout with varying input noise and fixed hidden noise of 0.5. 
Column-2: Using dropout with varying hidden noise with fixed input noise of 0.2. Column-3: 
Using Random-dropout with varying noise range [0, x] used at hidden and input layers. 


Not exactly about regularization, but uncertainty, Gal & Ghahramani (2015) gave a very inspiring 
interpretation of dropout as a Bayesian approximation. 

Different noise schemes are used by Poole et al. (2014) and applied at different positions in au¬ 
toencoders; input, pre-activation and post-activations. They report better results than denoising 
autoencoders and also support that Gaussian noise yields better results than dropout on MNIST 
classification task. 

Bachman et al. (2014) emphasise bagging interpretation of dropout and propose a generalization 
called pseudo-ensembles and a related regularizer which makes it possible to train semi-supervised 
networks. 

Some recent work reports that noise level schedules, inspired by simulated annealing, help in super¬ 
vised and unsupervised tasks (Geras & Sutton, 2014; Chandra & Sharma, 2014; Rennie et al., 2014). 
We propose an alternative that avoids a schedule and rather uses random noise levels such that the 
model cannot adapt to slowly changing noise distribution. 

A similar approach of sampling the noise level was used in Geras & Sutton (2014) in the context 
of unsupervised learning using an autoencoder (on input not features). However, they show that the 
approach is not very useful in their case. 

Finally, work by Graham et al. (2015) is related to random noise as both their submatrix multiplica¬ 
tions and random noise level p are inducing an independence between neurons of a single layer. They 
found that the independence is not to damageable if they use enough different submatrix patterns. 

7 Conclusion 

We have presented and justified a novel interpretation of dropout as prior-knowledge free data aug¬ 
mentation. We described a new procedure to generate samples by back projecting the dropout noise 
into the input space. Our results suggest neural networks can be trained without dropout on such 
noisy samples and still yield good results. Nonetheless, experiments should be performed on larger 
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networks in order to determine whether this observation is just a particular property of relatively 
small networks. Furthermore, trained networks should be analyzed to determine if co-adaptation is 
still avoided when using per-layer noise back-projection on deep neural networks. 

Presenting only random dropout, the list of possible substitute to dropout in this work is far from 
exhaustive. As described in section 4, important knobs to modify induced data augmentation by 
noise are model’s features and the noise scheme applied on them. Using semi-supervised cost can 
influence the implicit transformations by forcing the network to learn more general features. A 
network could also be trained on x* samples generated from another network, similarly to generative 
adverserial networks (Goodfellow et al., 2014a). 
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8 Appendix 

8.1 Proof of x* unlikeliness 

We can show, with a proof by contradiction, that it’s unlikely to find a single x* = = 

... = that minimizes well L. 

By the associative property of function composition, we can rewrite equation 7 

/W(a;*) = O (/(*-!) (a;*)) (11) 

Suppose there exist an x* such that 

o (a:*)) = o j (12) 

o (a;*)) = o (13) 

Based on 11 and 13, we have that (x*) = (x). The proof is concluded by replacing the 

latter in 12 and then expanding the composed functions. 

o (/(*-!) (a;*)) = o (/(*-i)(a;*)) 

red (a;*))) = © red (a;*))) (14) 

Equation 14 can only be true if does not apply any modification to rect (^*))) ’ that 

means = 1 when rectj (^*))) > 0- U happens with a probability where 

P(^) is the Bernoulli success probability, is the number of hidden units and is the mean 
sparsity level, i.e. mean percentage of active hidden units, of the i-th hidden layer. This probability 
is very low for standard hyper-parameters values. Eor instance, with p(^) = 0.5, (i(^) = 1000 and 
= 0.15, the probability is as low as 10“^^. 
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